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Abstract 

The paper deals with the problem of finding the best alternatives 
on the basis of pairwise comparisons when these comparisons need not 
be transitive. In this setting, we study a reinforcement urn model. 
We prove convergence to the optimal solution when reinforcement of 
a winning alternative occurs each time after considering three random 
alternatives. The simpler process, which reinforces the winner of a 
random pair does not always converges: it may cycle. 

1 Introduction 

In a pairwise comparison problem, one is given a set of alternatives, with 
data about how they compare the ones to the others. In its purest form, on 
which we focus in the present paper, we simply have, for any pair of distinct 
alternatives, the information of which one "beats" the other. Such a data 
set is called a tournament. Basic results on this structure can be found in 
Moon [18J. 

For logical as well as practical reasons, binary relations are at the basis 
of choice theory. Two classical examples are sport competition and majori- 
tarian politics. Many sports involve by definition two players (or teams), 
so that competition among any number of players must take the form of 
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the aggregation of pairwise comparisons. In majority voting, a candidate is 
socially preferred to another if and only if a majority of the voters prefer the 
former to the latter. More generally, the prevalence of that kind of binary 
relations can be traced back to specific features of efficient natural languages 
(Rubinstein [25]). 

If a chess player beats all the other players, he or she is clearly the best. 
If a candidate cannot be defeated under majority rule by any challenger, 
that "Condorcet" candidate can claim to be the best according to majority 
rule0 But if no alternative beats all the others, it is not clear how to define 
the best alternatives. The problem of choosing from pairwise comparisons 
has thus attracted the attention of scholars in various fields, most often from 
the axiomatic, normative, point of view (David [5j; Fishburn [6j; Laslier [13 j : 
Brandt et al. [1]). 

In the present paper we tackle the same problem from an evolutionary 
perspective instead of an axiomatic one. We consider dynamic processes 
according to which, at each period in time, a small number (2 or 3) of alter- 
natives are sampled, the tournament is played among these few alternatives, 
and the winning alternative is reinforced in the sense that it will be sam- 
pled with higher probability in the future. Where does such a mechanical 
adaptive process go? Using a standard urn model, where reinforcing an 
alternative is adding a colored ball to an urn, we obtain two results. 

(i) If one samples three alternatives (distinct or not) at each date, the 
process is able to discover the optimal solution of the tournament, that is 
the unique probability distribution p* which is, in expectation, defeated by 
no alternative. With probability one, the composition of the sampling urn, 
which defines the probability Pr of choosing the various alternatives at time 
r, tends to p* when r tends to infinity. 

(ii) If one samples only two alternatives at each date, the process is not 
able to discover the optimal solution, unless the solution is degenerated, 
with one alternative defeating all the others. With probability one, the 
composition of the sampling urn, which gives the probability p^ of choosing 
the alternatives, concentrates on the support of the optimal solution p* . 
This means that all the alternatives which are played with zero probability 
in the optimal solution are chosen with a probability going to 0. However the 
composition of the urn may cycle, staying away from the optimal solution. 
In some cases we even prove that it cycles with probability one. 

The negative result (ii) echoes known results about the evolutionary 
instability of mixed equilibria in evolutionary game theory. For instance 

^This observation does not imply that the majority principle is good for Politics. 
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cycling with probability one is proven by Posch [23] in a reinforcement urn 
model for 2x2 games. 

The positive result (i) seems more original. In a study of imitation 
processes in Matching Pennies games, Hofbauer and Schlag [11] observe that 
players end up closer to the equilibrium if they sample several individual 
before imitating: there is still cycling, but closer to the equilibrium. Our 
results might be interpreted in the same spirit: learning slower leads to 
more stability. 

The techniques we use to derive these results are standard in the field 
of adaptive processes with reinforcement; see Pemantle 2007 [22], and they 
belong to the family of martingales techniques. The main ingredient of the 
proof is the definition of a well chosen function of the process whose values 
form a martingale (see (llip ). We use the convergence theorem for positive 
martingales to obtain some global asymptotic information about the process. 
In the case of three alternatives we get fairly directly the convergence of the 
process while for the case of two alternatives the convergence theorem has 
to be complemented with a variance analysis to prove the non-convergence. 

The paper is organized as follows. Section [2] introduces the necessary 
notions about tournaments: definition and notation (|2.ip . the Markov chain 
induced by the play of small-size tournaments at each date ()2.2p , the tour- 
nament game which allows to define and to prove existence of the optimal 
solution (12. 3p , and some further preliminary material ()2.4l 12. 5p . Section [3] 
starts by the definition of urns and of the adaptive processes p.ip . Then, 
in order to illustrate the argument in a simple way, a toy example is intro- 
duced and treated according to its deterministic approximation (|3.2p . The 
statement and proof of our main result on three-alternatives reinforcement is 
found in (j3.3p and two-alternatives reinforcement is treated in ()3.4p . before 
a short conclusion ()3.5p . 

2 Framework 
2.1 Tournaments 

Let X be a finite set. A tournament T on X is a complete and antisym- 
metric binary relation. For any x and y in X, one and only one of the three 
possibilities occurs: x = y,xTy, or yTx. When xTy we often say that x 
beats y. Define the sets : 

T+{x) = {yeX: xTy}, T-{x) = {y € X : yTx}. (1) 
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The binary relation T is fixed throughout this paper. It is sometimes easier 
to use the notation: 



max{x, y} 



X \i xT y OT X = y 
y if yT X. 



An alternative which beats all other alternatives is called a Condorcet win- 
ner. A tournament can have a Condorcet winner or not, but cannot have 
two. The Top- Cycle of the tournament T is the smallest (by inclusion) set 
Y <ZX such that: 

y X e X\Y,3y eY : yTx. 

It is easily seen that such a set is unique and reduces to a singleton {c} 
if and only if c is a Condorcet winner. The literature on tournaments and 
formal political science has shown that the Top-Cycle is usually a very large 
set (McKelvey [13 )i and has proposed many refinements of this set (see [l^ 
for a survey). 



2.2 A Markov chain 

Let A(X) be the set of probability distributions on X and let p G A(X). 
The support of p is denoted by Supp(p). Given p, define a sequence (pW)tg]N 
of probability distributions on X derived from p in the following way : 

pM=p, (2) 
p[*+il(x) = /l(x) •p(r+(2;) U{x}) +/l(T+(x)) •p(x), (3) 

for any t ^ 0, for any x ^ X 

The interpretation is that p'*! is the distribution of a random variable 
^(t) € X such that ^(0) is chosen at random according to p and then, given 
that ^{t) = X, ^{t + 1) is the winner (according to T) of the comparison 
between x and some alternative y randomly chosen in X according to p. 
Therefore ^(t + 1) = x either because ^(t) was already equal to x and y was 
chosen in T+(x) U {x} (first term in the above formula), or because ^(t) was 
in T~^{x) and x was chosen according to p (second term). We call p the 
"sampling" probability. 

This process is usually considered with p uniform on X (Daniels [1], 
Ushakov [Mj, Levchenkov [16], Slutzky and Volij [27], Chebotarev and 
Shamis [21 E]). We need the general version because, later in this paper, 
p will be endogenous. 
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Given p, the stationary distribution for this finite Markov chain exists 
and is uniqueH we denote it by p^°°L It is characterized by the fact that 
Supp(p[°°]) C Supp(p) and, for any x in Supp(p), 

p[~l(r+(x)) = •p(r-(x)). (4) 

Notice that the inclusion Supp(p[°°l) C Supp(p) may be strict; indeed, 
p[°°l(x) = when p(T~{x)) = 0, that is when x beats no alternative in 
the support of p. More exactly, Supp(p[°°]) is the Top-Cycle of the restric- 
tion of T to Supp(p); thus Supp(p[°°l) does not exactly really depends on 
p but only on Supp(p). If p has full support, for instance in the usual case 
where p is uniform, Supp(p[°°l) = TC{T). 

2.3 The tournament game 

The tournament game is the two-player, symmetric, zero-sum game defined 
by the strategy set X and the payoff function g{x, y) = +1 li xT y, g{x, y) = 
if X = y, and g{x,y) = —1 if yTx. For p,q & ^i^) two probability 
distributions on X, write: 

9{p,q) = ^ g{x,y)p{x)q{y). (5) 

y&X 

From the definition, g is clearly antisymmetric: g{q,p) = —g{p,q)- 

The tournament game has been studied by graph theorists (Ficher and 
Ryan [El El [10]) and has more recently attracted attention of computer sci- 
entists (Rivest and Chen i24j). As a model of majority voting and two-party 
electoral competition, it studied in Social Choice theory and formal Political 
Science (Moulin [19], Myerson [201 [21], Laslier [HllIS]). Remarkably, such a 
game has a unique equilibrium. Here is the precise result that will be needed 
in the sequel. (Fisher and Ryan [9j prove this result using linear algebra and 
Laffond et al. [12j have a direct proof using a parity argument.) 

Proposition 1 There exists a unique p* € A(X) such that g{p*,q) ^ for 
all q € A(X). This p* , called the optimal strategy, is also characterized by 
the following : for all x (z X , 

p*{x) > <;=^ g{x,p*{x)) = 
p*[x) = g{x,p*{x)) < 0. 

^We state the results in this section without proofs. They are easily derived from 
elementary theory of finite Markov chains and have already been noticed for p uniform in 
the mentioned references. 
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The support of the optimal strategy is cahed the Bipartisan Set of the 
tournament: Supp(p*) = BP(T). This set is a subset of the Top Cycle and 
the inclusion is often strict. For instance, in totally random tournaments, 
the Top Cycle contains all the alternatives and the Bipartisan Set contains 
only half of them (Fisher and Reeves [7]). 

2.4 Two formulas 

Before we go further and explain the relation between the game optimal 
strategy and stationary probabilities, it is useful to state two technical for- 
mulas. The following lemma describes the probabilities p'^^ and p'^^ , obtained 
after sampling two or three alternatives with the Markov chain defined in 
Section [2.21 in term of the payoff function g. 

Lemma 2 For any x & X : 

pl^](x) =p{x) ■ {l + g{x,p)), 

pi^\x) =p{x) • 1 + loix^p) + \g{x,pf + '^Piy)gix,y)g{y,p) 

\ S/GX 

Proof. First let us notice a useful equality. By definition ([T]) and ([5]) 

g{x,p) =p{T+{x)) -p(r~(x)), (6) 

and, since p{T^{x)) + p{T' (x)) + p{x) = 1, we get: 

l+g{x,p) = 2p{T+{x))+p{x). (7) 

Let a and b be chosen according to p and let x = max{a, b}, then: 

pi'^^(x) = Pr[a = x] ■ Pr[6 G T^{x) U {x}] + Pr[a G r+(x)] • Pr[6 = x] 
= p{x) ■ [2p(T^{x)) + p{x)) 
= p{x) ■ {1 + g{x,p)) . 

For the second formula, let a, b and c be chosen according to p. An 
alternative x appears as x = max {max{a, 6}, c} in the two exclusive cases: 

X = max{a, b}, and x = max{x, c} . 
X T max{a, b}, and x = c . 
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In the first fine, the event x = max{a, b} has probabfiity pPl (x) = p{x) ■ (1 + 
g{x,p)) so the probabifity of the first case is p{x) ■ (1 + g{x,p)) ■ p{T'^{x) U 
{x}). In the second fine, the event xTmaxja, 6} has probabifity 

P^'\T+{x)) = J^P^'Ky) =J2p{y){l + g{y,p)), 

yeT+{x) yeT+{x) 
therefore the probabifity pl^l is: 

p^^\x)=p{x)-{l + g{x,p))- {p{x) + '^p{y) \ + p{x)'^p{y){l + g{y,p)) 

= p{xf [i + g{x,p)] +p{x) J2p(y) [C^ + aix,?)) +9{y,p)] ■ 

y£T+{x) 

Using the fact that ^^^^'^^ is 1 if j/ G r+(x), is 1/2 if y = x, and is if not, 
one finds: 

j3p](a;) 

p{x) 

= Y^Piy) [(2 + 9{x,p)) + 9{y,p)] 1±^^ 
y 

= h^Piy) [{^ + 9{x,p)) + g{y,p) + 2gix,y) + g{x,p)g{x,y) + g{y,p)g{x,y)] 
y 

= 1 + \g{x,p) + \g{p,p) + g{x,p) + \g{x,pf + \ ^p{y)g{y,p)g{x,y) 

y 

= 1 + lg{x,p) + \g{x,pf + ^^p{y)g{y,p)g{x,y), 

y 

which is the announced formula. Q.E.D. 



2.5 Relation between optimal strategies and stationary prob- 
abilities 

We first observe that the game optimal strategy p* satisfies a nice fixed- 
point property if we take pt^l = p* as the sampling probabifity to build the 
Markov chain, and that only an optimal strategy can be such a fixed point. 

Proposition 3 Let p* he the optimal strategy for the tournament game, 
then = = p* . Conversely, let p he such that p^"^^ = P, then p is 

the optimal strategy for the tournament game restricted to the support of p. 
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Proof. By lemma [21 p*^ \x) = p*{x){\ + g{x,p*), and, by proposition [H 
either p*{x) = or g{x,p*) = 0. 

Conversely if p^'^^x) = p{x) = p{x){l + g{x,p)) then g{x,p) = as soon 
as p{x) > and p is the optimal strategy on its support. Q.E.D. 



3 Learning 

With the previous background material in mind, we turn to the main result 
of this paper. Instead of considering re-sampling at each date according to 
a constant probability distribution, as is done in the previously described 
Markov chains, we describe learning processes where winning alternatives 
are reinforced at the level of the sampling probability. These processes can 
be implemented with random urns. 



3.1 Choice by reinforcement 

An urn on X is a list n of strictly positive integers n(x), x E X. The integer 
n{x) is the "number of balls of color x in the urn n." The set of such urns 
on X is denoted by J\f, formally: 

A/' = IN^. 

To each n € TV is associated the probability distribution n on X defined by 

n(x) 



n{x) 



When we write that the alternative x is picked in the urn n, we mean that 
x is picked in X according to the probability n. 

A random urn sequence is a sequence r € IN of random variables on 
M such that Ur+i is defined conditionally on Ur- Here are three examples: 

1. Two- alternatives reinforcement. Given a realization € A/" of an 
alternative x is picked in X according to the probability distribution 
rv'^', and one ball of color x is added to the urn: nT-^i{w) = n-riw) + 1 
and for all v w, nr+i{v) = ririy). This means that two alternatives, 
say a and h are picked independently in the urn n-r, and are compared 
according to T. The result of the comparison is x = max{a, 6}, that is: 
X = a if a = 6 or if a T 6 and x = h lihT a. Alternative x is reinforced. 
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2. Three-alternatives reinforcement. Same thing as above, with the prob- 
ability distribution n^^^K This means that three alternatives, say o, 
b and c are picked independently in X according to n^; a, b and 
c are compared according to T in sequence and one ball of color 
X = max {max{a, 6}, c} is added to the urn. Remark that there are 
only two cases : ranked alternatives where we reinforce the top one or 
a cycle where we reinforce at random. 

3. Fast reinforcement. Same thing as above, with the probability distri- 
bution rv'°°', the stationary distribution for T when sampling is done 

according to n^. 

Remark that the first two examples can be concretely implemented easily, 
as described, but fast reinforcement cannot. 



3.2 A motivational example 

This section presents a simple non-rigorous argument to justify our focus on 

three- alternatives reinforcement. Consider the simplest possible non trivial 
tournament : a cycle of three alternatives A, B and C with AT B, BTC, 
CT A. In order to evaluate the long term behavior of two-alternatives and 
three-alternatives reinforcement, we use a deterministic continuous time mo- 
tion corresponding to the limit of a large number of balls in the urn. We 
write a{t), b{t) and c(t) the "number" of balls of each type and d(t) = a{t)/t, 
b(t) = b(t)/t and c(t) = c(t)/t the corresponding probabilities. 
For two-alternatives reinforcement we get : 

a' = + 2ac 

b' = b^ + 2ba (8) 



c' = + 2cb 



and we remark that 



dt 



^lna + ln5 + lnc^ = ^ (-31nt + Ina + ln6 + Inc) 



3 + 2ac 
= + + ... 

t a 

_ 3 a + 2c b + 2a c + 2b 

~t'^ t ^ t t 
= 

so (a, b, c) cannot converge to the optimal probability independently of the 
state at finite time. 
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It is then natural to study three-alternatives reinforcement, for which : 

a! = + 3d^c + 3dc^ + dbc 

b' = b^ + 3b'^d + 3ba^ + dbc (9) 

c' = + Sc^b + 3c6^ + dbc 

and for the same quantity 

3 + Sac + 3c^ + be 



dt 



^Ina + lnt + lnc^ = -- + 



+ ... 



t 

_ 4(a^ + P + c^) + 4(ac + bd + cb) -3 

~ t 

_ d'^ + P + - 2(ac + bd + cb) 

~ t 

and simple calculus shows that this last term is positive expect for d = 
6 = c = 1/3. Then In a + In 6 + In c is an increasing negative function so it 
converges. It is not difficult to see, using the divergence of j 1/Mt, that this 
implies that d^ + 0^ -\- i? — dc — bd — cb converges to and then that (a, b, c) 
converges to (1/3,1/3,1/3) (the details of the arguments will be given in 
the rigorous proof of the next section). 

With this example we can see that two-alternatives reinforcement should 
not converge to the optimal probability even for a simple tournament and 
when we neglect the effect of probabilistic noise while three-alternatives 
reinforcement seems to work in that case. In the next section we will prove 
that three-alternatives reinforcement actually converges for any tournament. 
Wc will use the same idea of computing the variation of In a + In 6 + In c with 
technical changes for the general tournament, the discrete time and the 
probabilistic evolution. 



3.3 Three-alternatives reinforcement and mcirtingale tech- 
nique 

We will now prove the result about three-alternatives reinforcement: 

Theorem 4 For any initial urn no G N , the random urn sequence obtained 
by three- alternatives reinforcement is such that the realization n^, r G IN 
almost surely verifies: 

lim = p* . 

T— >0O 

The same proof will actually also give the first part of result about two- 
alternatives reinforcement, which we thus state now: 
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Theorem 5 For any initial urn uq G M , the random urn sequence obtained 
by two-alternatives reinforcement is such that the realization n^, r G IN 
almost surely verifies: 

\lx e X,p*{x) = ^ lim rh-{x) = 0. 

r— >-oo 

The proof relies mainly on the study of a well chosen function of the state 
of the urn. Let LD denotes a discrete logarithm: for integers < a < 6, 



b-i 

LD[a,b] = ^J2-- (10) 



Recall that at time r G IN, nr{w) denotes the number of u;-balls in the urn. 
The total number of balls is increasing by 1 at each time, so ^yjnT-{w) = 
A + T. The probability of drawing a w-ball is hriw) = nT-{w)/{A + r). 
Consider the quantity 

fir=Y. LD[nr{w), A + r]- p*{w), (11) 

wex 

that is the expected value, according to the optimal probability p*, of the 
discrete logarithm at time r. 

Proposition 6 For both two-alternatives and three- alternatives reinforce- 
ment, the sequence fj,r , t G IN is a negative sub-martingale. More precisely 
we have, for two-alternatives reinforcement: 

II 9{p*,nr) 

E [nr+i - JJ-T I rir] = —7- . 

A -\- T 

and for three- alternatives reinforcement: 
E [/Xr+l - IJ-T I rir] 

I yip*,n) + 5 XI 9{'w,nfp*{w) + ^hiv)g{p* ,v){l g{v,fi)) J . 



wex 



Proof. We will write p for hr and let i denote either 2 or 3. From r to 
r + 1, one and only one ball is added. This ball has type w with probability 
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pW(u'). Thus: 



1 



1 



A + T 



nT-{w) 



A + T n-riw) 



■ p*{w) 



p*{w) 



p\-i-\{w) p*{w) 
A + T ' ^ p{w) A + T 



Where in the last hne we used the definition p(w) = nr{'w)/{A + r). Using 
the formula for Qf lemma [21 it comes, for two-alternatives: 

E - /^r I n^] 

A+T ^ A+T 

wax 

^ 9{P*,P) 

A + T ' 

which is always non-negative. Furthermore, g{p* ,p) = implies, by the first 
part of proposition [H that Supp(p) C Supp(p*). 
For three-alternatives, we have: 



+ E ( 1 + y{'w,p) + \g{w,pf + ^p{v)g{w,v)g{v,p) 



p*{w) 

A + T ' ' ^"^"'"^ ' ^"^"^'"^ ' ^ — V-'-— ; 

weX \ V / 

one can re- arrange: 

{A + t)E [^JLr+l ~ Air I rir] 

= h{P*,P) + ^ X] 9{w,pfp*{w) + ^p{v)g{p* ,v)g{v,p) 

weX V 

= \9{P*^P) + I XI 9{w,pfp*{w) + '^p{v)g{p* ,v){l + g{v,p)). 
wex V 

And all the terms in this sum are non-negative. The sum can be only if 
both Supp(p) C Supp(p*), and g{w,p) = for all w in the support of p*, 
which implies p = p* by the uniqueness in Proposition [TJ Q.E.D. 
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We are now able to prove the two results of the beginning of the section. 
Proof of Theorems [4] and [5l We consider for this proof either two or 
three alternatives reinforcement. We have: 



^ - flt-1 

.t=l 



t=i 



(12) 
(13) 



By Proposition [6l /i,- is a negative sub-martingale, so it converges almost 
surely to an integrable random variable fioo (see Corollary VII. 4.1 and 
VII. 4. 2 in [26]). Furthermore, the right hand side is an increasing nega- 
tive sequence so it converges to a finite value. In the left hand side, the sum 
is an increasing sequence of positive random variables so by the monotonous 
convergence theorem (Theorem II. 6.1 in [25]) we can take the limit inside 
the expectation. Hence E [^^-^ E[/it — = limE[/iT-] — E[/io] is 
finite and so Yl'^i^if^t ~ /^t-il/^t-i] is almost surely finite. 

Let /[2l(p) = g{p*,p) and /[3](p) = g{p,p*) + EweX 9{w,pfp* (x). The 
simplex A(X) is embedded in IR"^ so we use the Loo distance. With this 
distance, is continuous and d{n^,n^^) ^ almost surely. Denote by 
B(p, rf) the ball of center p and radius r]. 

Now consider a single realization of the urn process. Since X is a finite 
set, A(X) is compact, so let be an accumulation point for n^. We will 



show that necessarily / 



rir- 



0. Looking for a contradiction, suppose 



Since /W is continuous, let e, ry > be such that G B{noo,r]), /'*'(p) > 
e. Let (j)he a sub-sequence such that Vr, G B{n^, rj/2) and 0(t + 1) > 

{l+ri)(t){T). Then: 



^E[/it+i - ^lt\nt\ ^ ^ 



1 



t=o 

oo 



2{A + t) 
1 



(14) 



^^^2(^ + </.(r)) 

oo 



L(l+r,/2)</,(r)J 

t=<i>{T) 



E 

T = 



1 



2(^ + </>(r)) 



L(l + ??/2)0(r)Je 



(16) 



The right hand side of the last line is infinite. We already proved that 
the left hand side is almost surely finite. This contradiction proves that all 
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the accumulation points of rTj are zeros of /W. It follows that /'*'(nt) 
almost surely. We have seen in the proof of Proposition [6] that this fact 
implies exactly the theorems. Q.E.D. 



3.4 Two- alternatives reinforcement and variance estimates 

In this section, we study in detail the two-alternatives reinforcement. The 
main idea is to study the variance of fioo conditionally on the state at a large 
time t. 

Theorem 7 For any initial urn uq € Af, the random urn sequence obtained 
by two- alternatives reinforcement is such that: 

1. almost surely, for all alternatives x such that p*{x) = 0, nT-(x) — )• 
when r — > oo; 

2. with positive probability, the realized sequence hr,T G IN has no limit 
as T ^ oo; 

3. if T is such that \/x,p*{x) > (in other words, BP(T) = X) then, 
with probability one, hrjT G IN has no limit. 

To simplify notation, in this section we let the process start at r ^ 
so that r always denote the number of ball in the urn (i.e. ^4 = 0). We 
will also only consider two-alternatives reinforcement in this section. Recall 
the piece of notation Supp(p*) = BP. Also recall from the last section 
that fir is a negative submartingale so it has an almost sure limit fioo- Let 
(j) = J2xeBpP*(^) logp*{x) be the value of ^oo when fir converges to p*. 

The first point is the following variance estimate : 

Lemma 8 Let r > and let erix) = p*{x)/hr{x) — 1. We have 

E[{f,r+l - fLrflJ'r] = ^ Yl ^^^^^^ i^K {xf (17) 

xeBP 

Proof. This is a straightforward computation : 

E[{^,r^^ - f^rflTr] = (x) (- J] p* (y) ^ + (x) ^t) ' (18) 

7- nix)' 

X y 
^-^ n(xj 

X ^ ' 

= _L^nP](x)e.(x)2 (20) 

X 
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Q.E.D. 

The main point is the factor that make the series of those terms 
summable (once a small control on e is provided). Thus the variance of /ioo 
conditioned on J> will be of order e^/r. Thus with hight probability /ioo 
will be close to so that if fir is far enough from 0, then /ioo 7^ 

We will first consider the case where BP 7^ X. In this case we have 
E[/i,+i-/i, I TV] = ^ go-n{BP') > (whereto = mfy^BP^ g{p* ,y)) 

so we need an estimate of h{BP). 

Lemma 9 Suppose that there exists vr G (0, 1) such that, at each time r the 
probability of adding a ball in BP^ is inferior to it ■ hr{BP^). Then we have 
for all T ^ tq, 

nnriBP'^)\Tr]^ n ^^]~^ nnro] 

t=TO+l 



Proof. The first line comes from a straightforward induction 



T-l 



and for the second line we have 



log n — -t — ^ E + —r^ 

t=To + l ro + l 



ro + l 

^ (vr-l)log(-) 

To 



Q.E.D. 



Now we are able to prove the third point of the theorem. 
Proof of theorem O case BP'^ ^ 0. Remark that for 5 > small 
enough and tq big enough, the set 

5 = {n G AAs.t n(X) ^ tq and \<\) - ^p*{x)LD{n{x),n(X))\ ^ b} 
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verifies 



• Vn G S,yx e BP, \p*{x)/h{x) - 1| ^ 1 

• Vra G S, Va; G BP", 1 + g{x, ra) ^ tt 

for some tt > 1 — infx<=BP<^ 9{p*, 

Let us assume that hrg G S, (which clearly happens with positive prob- 
ability). Let T be the first time after tq such that hr ^ S. T is a stopping 
time so i^tat is still a submartingale, let us call it iJ.'^. Furthermore, up to 
time T, we have 

- Hr I ^r] = ^^^^ 

nr{BP) 



and thus 

^rl--r^--^fir,{BP). 



Now let denote the almost sure limit of ^t/\T = f^'r- Remark that if 
\'P — tJ'oo\{'^) < S then T{uj) = oo and thus Hooi^) = l^'ooi'^)- I* is therefore 
enough to show that, with positive probability (f) — 5 < jj!^ < <p. 

Note that fi' is a bounded submartingale, so it also converges in and 

toward /j,'^. Thus 



t=TO 
OO 

*=T0 
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and 



Var(^'^ I -^ro) ^ I -^-^o) 

t=TO 
OO 



t=TO 
OO ^ 

t=TO XGBP 

•^0 



Finally consider a tq large enough so that <C (5. It is clear that with a 
positive probability, tt-t-o ^ 5 and |^ro — is close to 5/2. Under this event 
we see that — //^| is a random variable with expectation close to (5/2 and 
variance small with respect to 6 so — (p\ has a positive probability to be 
in [5/4,35/4] which proves the theorem. Q.E.D. 

Now we turn to the case where X = BP. The idea will be similar, with 
Lemma [8] being the core argument. The main simplification comes from 
the fact that in this case g{p*,p) = for all p so ;U is a martingale and 
Lemma [9] will no longer be needed. However, in order to prove that /ioo is 
almost surely different from cf), we will need an almost sure lower bound on 
10 — fir\ which will come from a careful analysis of the difference between 
discrete and real logarithms. Finally since the almost sure bound that we 
will get will be much worse than the one we were able to have with positive 
probability, we will need to be more careful in our use of Lemma El 

First recall the following well known approximation result : 

Proposition 10 Let k > 0, there exists a constant 7 (Euler's constant) 
such that 



^OO-j^fc-j^ -j^OO-|^ 

log{k + l) + j-- ^ j;^log(fc + l) + 7-- Y -2 



i=k+l 1=1 i=k+2 



(21) 



Furthermore when k tends to infinity, 



00 

i=k+l 
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This proposition implies the following corollary: 

Corollary 11 Let T be a tournament on the set X such that BP = T. 
There exists c > such that, for any urn n G M, 



^p*ix)LD{nix),T) ^(p- -. (23) 



Furthermore, writing e{x) =p*{x)/n{x) — 1, if we restrict ourselves to large 
enough r (with e staying bounded) the constant c can be taken as close as 
we want to : 

\X\-l + E.eBP<^) ^24) 

Proof. This is a straightforward computation using the definition of LD. 

Q.E.D. 

We also need a control of e in term of /j, 
Lemma 12 Almost surely, for any time r 

J2 [M^) +P*{x)/2]erixf ^(t>-fir (25) 
xeBP 

Proof. We have 

(l>- fJ'T^ XI P*ix)[^ogp* - logp] (26) 

xeBP 

^ X p*{x){er{x) + er{xf/2) (27) 

xGBP 

f/2+ X nr(x)(l + e,(x))e,(x) (28) 
xeBP xeBP 

^ X + (29) 

x&BP 

because YlixeBP^T{x)^{x) = YlixeBpP*^'^) ~ nr{x) = 0. Q.E.D. 



Together, the last lemmas have the following consequence 

Lemma 13 Let tq > large enough such that — /Xt-qI ^ 
exists vr > such that, with probability at least n, 

1 



Vr > To, {(f) - Ht\ ^ 

To 
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Proof. Let d = — ^ro\ l^t T = inf{r > ro|</> — /^r ^ 2d oi (p — ^ 
(i/5}. T is a stopping time so ^'^ = (J-tat is still a sub-martingale ; by 
definition it is also bounded so it converges almost surely and in all L^. Let 
/i'oQ denote its limit. We will show that P(/i'oo £ {d/5, 2d)) > tt > for some 
TT. Remark that, since makes vanishing steps, we are only interested in 
the behaviour of /x close to (j), so we can restrict ourself to small e. 

As long as T < r, by definition we have (j) — fir ^ 2d and thus by 
Lemma [T2l ^^(p*(x)/2 + n,-(x))eT-(x)^ ^ 2d. For e small enough, this 
implies '^^nl^^e{x)^ ^ 2d and thus, by Lemma[5J 

yT>ro,E[U+i-^H?\n<^- (30) 
Summing up to infinity (recall that fi' has constant expectation): 

oo 

Var(/i'^ - /i'^JJ^ro) = Yl ^[(^r+i ~ ^r)^l-^To] (31) 

T=ro 

^ 2o?^— (33) 
To - 1 

^ (34) 
5 

where we used the hypothesis d ^ in the last line. 

Remark that, since ^tat is bounded, E[/i'oQ] = d. Moreover notice that a 
random variable with expectation d which never takes value in the interval 
{d/2,2d) has at least variance Since Var(/z'o^|J>) ^ we have 

P(/u' G {d/2,2d)) ^ 1/10 and on this event, T = oo so — fi\ has never 
reached d/2 ^ i. Q.E.D. 



T=TO 
OO 



Proof of Theorem \7l case = 0. First note that if hr converges, 
it has to be toward a fixed point. It is easy to see that the fixed points 
of the dynamics are exactly the optimal strategies py corresponding to all 
subtournaments Y X. (This of course includes p* = p\ itself.) For 
y C X, YlixP*xi^)^^^PYi^) ~ using the Markov inequality 

on fjL we see that convergence to those fixed points is impossible. Therefore 
we only have to rule out convergence towards p\. 

We first consider the case |X| ^ 12. Then by Corollary W\\ for any r 
large enough, fi-r almost surely verifies the hypothesis of Lemma[T3l Fix any 
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suitable To, by Lemma [H P(n^ ^ p*) ^ P(3r ^ tqI ](?;»- /^r I ^ I/tq) ^ l-vr. 
On the event that — fj,r\ does reach I/tq at time ri, we can use Lemma [13] 
at time ri to get ¥{h-j- — >■ p*) ^ (1 — vr)^. By induction we get P(n^ p*) = 
which proves the theorem. 

For the case 3 ^ |X| ^ 11 (a non trivial tournament has at least 3 
elements), consider any tq large enough. By Corollary [11] we have |(/> — ^rol ^ 
I/tq. Let T = inf{r > tq\\4> — fi^-] ^ l/2roor|(/) — fir\ ^ The event 
{T = oo or |(/) — ^ has probability at least 1/9 and if |0 — fix] ^ 
we can apply Lemma [13] at time T so the conclusion of Lemma [13] is still true 
with 1/ro replaced by 1/2to and we can use the same induction as before to 
prove the theorem. Q.E.D. 



3.5 Conclusion 

We have found the behavior of learning process designed to discover the 
"best" alternatives in a tournament. Learning is achieved through the fol- 
lowing idea. An alternative which is considered as "good" at some date is 
reinforced for the future in the sense that one (slightly, and less and less) 
increases the probability for this alternative to be considered: reinforcement 
updates the sampling, or "prior" probability. The test according to which 
an alternative is considered as a good one at time t rests on comparing a 
few randomly chosen alternatives. 

We found a very different behavior between the processes where rein- 
forcement occurs after sampling two or three alternatives. With three al- 
ternatives, the process converges almost surely to a well-defined limit that 
has a nice interpretation in term of the tournament game: it is the optimal 
strategy for this game. One can therefore say that this form of learning is 
"successful ". With two alternatives, the picture is more complicated. The 
learning process "succeeds " in finding the Bipartisan set (a set which has 
been argued to be more important in term of social choice than the numeri- 
cal values of the optimal probabilities [H] ) , but not the optimal probabilities 
themselves. We conjecture that the almost sure non-convergence happens 
for all tournaments and not only when BP = X. 
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